CONTRAlign: Discriminative Training for Protein Sequence Alignment

نویسندگان

  • Chuong B. Do
  • Samuel S. Gross
  • Serafim Batzoglou
چکیده

In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 1516% in alignment accuracy for low-identity sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A max-margin model for efficient simultaneous alignment and folding of RNA sequences

MOTIVATION The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction. In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for ...

متن کامل

Discriminative Structured Models for Biological Sequence Analysis a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Making predictions is a key element in many computational biology applications: given a set of input biological sequences, use an inference procedure to generate some corresponding predicted output. The prediction process involves defining an appropriate scoring model for comparing alternative output predictions, developing efficient inference algorithms for choosing high-scoring outputs, and c...

متن کامل

Training Protein Threading Models using Structural SVMs

Protein threading is the problem of inferring the structure of a protein from its sequence by matching the sequence against a set of known structures. Unlike conventional sequence to sequence alignment tasks, alignment models for threading can exploit a rich set of features derived from the geometry of the known structure. To make use of these complex and interdependent features, we explore the...

متن کامل

Discriminative Pruning for Discriminative ITG Alignment

While Inversion Transduction Grammar (ITG) has regained more and more attention in recent years, it still suffers from the major obstacle of speed. We propose a discriminative ITG pruning framework using Minimum Error Rate Training and various features from previous work on ITG alignment. Experiment results show that it is superior to all existing heuristics in ITG pruning. On top of the prunin...

متن کامل

Learning to Align Sequences: A Maximum-Margin Approach

We propose a discriminative method for learning the parameters of linear sequence alignment models from training examples. Compared to conventional generative approaches, the discriminative method is straightforward to use when operations (e.g. substitutions, deletions, insertions) and sequence elements are described by vectors of attributes. This admits learning flexible and more complex align...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006